31.6 The Automation of Research
391
of a particular gene (and perhaps their co-occurrence with mentions of a particular
disease). The unorthodox usage of language in many contemporary papers is difficult
enough for a human reader to interpret, let alone for artificial intelligence. 16 Hence,
whether the results of such mining are going to be useful is a moot point. There
appear to be no attempts currently to weight the value of the “ore” according to some
assessment of the reliability of any facts reported and assertions made. But these
difficulties must be weighed alongside the general growth in overall understanding
that is hopefully taking place. The edifice of reliable knowledge gradually being
erected from the bricks supplied by individual laboratories allows inferences to be
made at an increasingly high level, and these might well render largely superfluous
endless automated reworking of the mass of facts and purported facts reported in the
primary research literature.
One area in which it seems likely that something interesting could emerge is the
search for clumps or clusters of objects (which might be words, phrases, or even whole
documents) for which there is no preexisting term to describe them. Such a search
might be based on a rather abstract measure of relevance (which must, of course, be
judiciously chosen), along the lines suggested by Good (1962), and adumbrated in
Sect. 13.2. This would be very much in the spirit of the clusters emerging when the
frequencies of nn-grams in DNA are examined (cf. Sect. 17.6).
If, indeed, knowledge representation moves toward probability distributions
(Sect. 31.3), it would be of great value if text mining could deliver quantitative
appraisals of the uncertainties of reported experimental results, which would have to
include an assessment of the entire framework of the experiment (cf. Sect. 6.1.1)—
that is, the structural information, as well as of the metrical information gained from
the individual measurements (cf. Table 6.1). We seem to be rather far from achieving
this automatically at present, but the goal merits the strongest efforts, for without
such a capability, we risk being condemned to ever more fragmented knowledge,
which, as a body, is increasingly shot through with internal contradictions. 17
31.6
The Automation of Research
Much of the laboratory work required for high-throughput genomics can be auto-
mated and carried out by laboratory robots according to a strictly executed set of
instructions. In many ways this is better than carrying out the manipulations man-
ually: the robot is likely to be able to execute its instructions more uniformly and
reliably than a human experimenter. It also has the advantage that a comprehensive
16 This is mainly a consequence of the language overwhelmingly used to write papers being English,
which is not the native tongue of most scientists nowadays, and the reluctance of open-access
publishers to spend money on editing.
17 Tensor factorization analysis is an encouraging movement towards more precision in text mining
(see Roy et al. 2017 for an application to transcription factors; note that this work confines itself to
analysing the abstracts of papers rather than the full texts—a corollary of which is that all publishers
should strive to ensure that the greatest possible care is taken in ensuring the integrity of abstracts).